-
Notifications
You must be signed in to change notification settings - Fork 164
fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
TypeError, NameError, and AttributeError indicate logical bugs rather than transient failures. Note: SQLAlchemy's StatementError is also a subclass of TypeError.
dbb23e9 to
4aab698
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the SlotName type from a simple NewType to a UserString class with parsing capabilities, and updates ResourceSlot to properly handle string keys and normalize values. The changes improve type safety and consistency in GPU resource allocation tracking across different accelerator types (CUDA devices, shares, MIG variants, and NPUs).
Key Changes
- Converted
SlotNamefromNewTypeto aUserStringclass with lazy parsing and accelerator detection capabilities - Enhanced
ResourceSlotto properly normalize keys to strings and process raw values (int, float, str, Decimal, BinarySize) into Decimal - Updated GPU allocation tracking to use generic accelerator detection instead of hardcoded CUDA-specific slot names
- Added serialization support for
SlotNameand improved JSON encoding for test fixtures
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/common/types.py | Core refactoring: SlotName class implementation and ResourceSlot normalization improvements |
| src/ai/backend/common/msgpack.py | Added msgpack serialization support for SlotName |
| src/ai/backend/common/utils.py | Added debugging utility pprint_with_type for type inspection |
| src/ai/backend/common/resilience/policies/retry.py | Added default non-retryable exceptions to retry policy |
| src/ai/backend/manager/models/resource_usage.py | Updated GPU allocation to use SlotName.is_accelerator() |
| src/ai/backend/manager/repositories/user/repository.py | Updated GPU stats aggregation to sum all accelerator slots |
| src/ai/backend/manager/repositories/group/repository.py | Updated container stats to use generic accelerator detection |
| src/ai/backend/manager/repositories/resource_preset/repository.py | Added str() conversion for dict keys |
| src/ai/backend/manager/models/image.py | Added str() conversion for resource limit keys |
| src/ai/backend/manager/api/etcd.py | Added str() conversion for API response keys |
| src/ai/backend/manager/idle.py | Improved variable naming from val to slot_val/util_val |
| src/ai/backend/manager/sokovan/scheduler/selectors/concentrated.py | Added type hint for sort_key list |
| src/ai/backend/agent/resources.py | Updated to use ResourceSlot type and SlotName class |
| src/ai/backend/agent/alloc_map.py | Added str() conversions for fnmatch operations |
| src/ai/backend/agent/docker/agent.py | Changed from dict unpacking to slots.copy() |
| src/ai/backend/agent/dummy/agent.py | Changed from dict unpacking to slots.copy() |
| src/ai/backend/agent/kubernetes/agent.py | Changed from dict unpacking to slots.copy() |
| src/ai/backend/agent/stage/kernel_lifecycle/docker/resource.py | Changed from dict unpacking to slots.copy() |
| src/ai/backend/accelerator/cuda_open/plugin.py | Added str() conversion for slot_name in metadata |
| tests/manager/conftest.py | Enhanced fixture_json_encoder to handle ResourceSlot and Decimal |
| tests/manager/repositories/keypair_resource_policy/test_keypair_resource_policy.py | Fixed test data to use string values for resource slots |
| tests/common/test_types.py | Added comprehensive tests for SlotName parsing and BinarySize/Decimal conversion |
| changes/2404.fix.md | Added changelog entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DEFAULT_NON_RETRYABLE_ERRORS: Final = ( | ||
| TypeError, | ||
| NameError, | ||
| AttributeError, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be provided publicly?
resolves #5926 (BA-2404)
Core changes:
SlotNameto be aUserStringsubclass which lazily parses the slot name format.{device_name}.{major_type}[:{minor_type}]gpu_allocatedcalculation to useSlotName.is_accelerator()method to sum all allocated acceleratorsResourceSlotto useUserDictwith proper generic type argumentsstras the key type, but this should be updated toSlotNamein future work (BA-2628; Migrate Manager API to Pydantic-based Request/Response Models #6173)Warning
When there are multiple different accelerators allocated in a single container or installed in a single agent node, this simple "sum" for
gpu_allocatedmay have a non-sense value when mixing different units and fraction scales. We need some design discussion here.This is kept as a known issue because the previous implementation was same in this sense.
Checklist: (if applicable)
SlotNameparsing implementation📚 Documentation preview 📚: https://sorna--6087.org.readthedocs.build/en/6087/
📚 Documentation preview 📚: https://sorna-ko--6087.org.readthedocs.build/ko/6087/